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ABSTRACT 

Although  there  have  been  attempts  to  develop  code  trans¬ 
formations  that  yield  tamper-resistant  software,  no  reliable 
software-only  methods  are  known.  This  paper  studies  the 
hardware  implementation  of  a  form  of  execute-only  mem¬ 
ory  (XOM)  that  allows  instructions  stored  in  memory  to 
be  executed  but  not  otherwise  manipulated.  To  support 
XOM  code  we  use  a  machine  that  supports  internal  comp¬ 
artments — a  process  in  one  compartment  cannot  read  data 
from  another  compartment.  All  data  that  leaves  the  ma¬ 
chine  is  encrypted,  since  we  assume  external  memory  is  not 
secure.  The  design  of  this  machine  poses  some  interesting 
trade-offs  between  security,  efficiency,  and  flexibility.  We 
explore  some  of  the  potential  security  issues  as  one  push¬ 
es  the  machine  to  become  more  efficient  and  flexible.  Al¬ 
though  security  carries  a  performance  penalty,  our  analysis 
indicates  that  it  is  possible  to  create  a  normal  multi-tasking 
machine  where  nearly  all  applications  can  be  run  in  XOM 
mode.  While  a  virtual  XOM  machine  is  possible,  the  un¬ 
derlying  hardware  needs  to  support  a  unique  private  key, 
private  memory,  and  traps  on  cache  misses.  For  efficient  op¬ 
eration,  hardware  assist  to  provide  fast  symmetric  ciphers  is 
also  required. 

1.  INTRODUCTION 

Software  piracy  is  a  significant  economic  problem.  The 
Business  Software  Alliance,  for  example,  estimates  that  pira¬ 
cy  cost  the  software  industry  11  billion  dollars  in  1998  [1]. 
In  addition  to  legislative  and  law  enforcement  efforts,  it 
appears  useful  to  develop  technical  methods  for  combat¬ 
ing  software  piracy.  In  principle,  a  software  vendor  might 
like  to  sell  a  single  copy  to  a  single  user.  However,  it  is 
generally  easy  to  copy  and  distribute  digital  information. 
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While  software  watermarking  might  aid  in  prosecuting  pi¬ 
rates  once  they  are  detected,  methods  aimed  at  identifying 
copies  will  not  prevent  copying  itself.  We  therefore  investi¬ 
gate  a  method  for  preventing  unauthorized  execution  of  soft¬ 
ware,  regardless  of  the  number  of  copies  made.  Our  method 
also  prevents  any  software  customer  from  examining  the  ex¬ 
ecutable  code  itself,  thereby  protecting  the  algorithms  and 
computational  methods  incorporated  into  the  code.  . 

The  protection  model  studied  in  this  paper  is  called 
“XOM”,  pronounced  “zom”,  an  acronym  for  eXecute-Only 
Memory.  The  mechanism,  also  described  in  [4]  and  relat¬ 
ed  to  concepts  presented  in  [2,  9,  10,  11],  is  based  on  the 
idea  that  code  stored  on  disk  or  other  media  can  be  marked 
“execute-only.”  Execute-only  code,  which  is  stored  in  an 
encrypted  form,  can  only  be  decrypted  by  the  instruction- 
loading  path  on  the  main  processor  chip,  thereby  preventing 
any  user  of  the  computer  from  examining  the  actual  instruc¬ 
tions.  This  simple  idea  requires  additional  mechanisms  in 
order  to  preserve  code  security.  In  particular,  data  written 
during  execution  must  be  encrypted,  in  order  to  prevent  an 
adversary  from  reverse-engineering  the  code.  This  is  espe¬ 
cially  tricky  for  data  stored  in  registers  that  are  accessible 
to  the  operating  system  after  an  interrupt,  since  encrypting 
every  write  to  a  register  and  decrypting  every  read  from  a 
resister  is  prohibitively  slow.  In  addition,  merely  encrypt¬ 
ing  data  may  not  prevent  sophisticated  attacks  that  involve 
permuting  the  contents  of  encrypted  data  locations. 

The  aim  of  this  paper  is  to  establish  that  hardware  im¬ 
plementation  of  a  XOM-like  execution  model  is  feasible  in 
practice.  To  support  a  secure  execution  environment,  we 
use  a  form  of  compartment  to  isolate  independent  software 
applications  running  on  the  same  processor  [23].  Each  com¬ 
partment  is  built  from  a  session  key^  used  to  encrypt  the 
associated  data.  Regular  unencrypted  code  can  run  in  the 
unprotected  or  null  compartment  that  has  no  key  associated 
with  it. 

Protection  is  provided  using  a  combination  of  public-key 
and  symmetric-key  cryptography  [19].  In  brief,  each  XOM 
chip  contains  the  private  decryption  key  of  a  public-key  en¬ 
cryption  pair.  The  corresponding  encryption  key  is  made 
public,  so  that  anyone  can  encrypt  code  for  this  chip  but 
only  the  chip  itself  contains  the  key  required  for  decryption. 
However,  if  the  entire  instruction  stream  of  a  XOM  appli¬ 
cation  were  encrypted  using  public-key  cryptography  (such 
as  RSA  [19]),  instruction  loading  would  be  prohibitively 
slow.  Therefore,  the  header  block  of  each  XOM  applica- 
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tion  contains  the  encryption  key  of  some  faster,  symmetric 
key  encryption  scheme  (such  as  DES  [20])  and  the  remainder 
of  the  application  is  encrypted  with  this  key.  Since  each  ap¬ 
plication  will  have  a  distinct  symmetric  key,  the  symmetric 
key  embedded  in  an  application  can  be  used  as  the  identifier 
for  the  compartment  in  which  the  code  is  executed. 

The  next  section  of  the  paper  (Section  2)  describes  the  ab¬ 
stract  XOM  machine  architecture  in  more  detail,  explaining 
how  the  machine  internally  protects  information  in  compart¬ 
ments.  We  begin  with  a  description  of  a  basic  machine  that 
does  not  use  main  memory  and  does  not  take  interrupts. 
We  then  extend  this  model  to  support  both  of  those. 

Section  3  looks  at  some  of  the  security  issues  with  XOM, 
and  addresses  various  ways  that  information  might  leak  out 
of  the  compartments  that  we  have  constructed.  Our  securi¬ 
ty  model  does  not  trust  external  memory  or  the  operating 
system  that  manages  the  execution  of  the  XOM  program. 
Providing  guarantees  in  such  a  model  is  a  particularly  chal¬ 
lenging  problem. 

Section  4  describes  poSvSible  implementations  of  a  XOM 
processor.  We  abstract  many  functions  into  a  XOM  Virtual 
Machine  Monitor  that  could  be  implemented  in  microcode 
or  software.  Section  5  discusses  hardware  necessary  to  make 
the  performance  of  XOM  reasonable  and  Section  6  evaluates 
the  overhead  of  this  hardware.  Finally,  we  close  the  paper 
with  a  summary  of  the  XOM  architecture  in  Section  7. 

2.  THE  ABSTRACT  MACHINE 

The  abstract  XOM  machine  has  three  principal  tasks:  de¬ 
coding  the  session  key  using  an  asymmetric  cipher,  decoding 
the  instruction  stream  from  external  memory  using  the  ses¬ 
sion  key,  and  providing  compartmentalized  storage  for  the 
XOM  code  to  use.  The  most  interesting  task  is  providing 
this  storage,  so  we  will  begin  with  that  problem. 

Our  basic  approach  to  secure  storage  is  to  tag  all  data 
with  a  XOM  identifier.  This  identifier  is  shorthand  for  a 
session  key  and  is  an  index  into  a  table,  called  the  session 
key  table^  that  maps  XOM  identifiers  to  a  decrypted  session 
key.  Programs  that  run  in  the  clear  without  encrypting  their 
code  belong,  by  default,  to  the  null  principal,  and  have  a 
XOM  identifier  of  zero. 

The  size  of  the  sesvsion  key  table  and  tags  depend  on  the 
number  of  concurrently  executing  principals  that  can  have 
data  in  the  machine.  In  the  simplest  machine,  the  identifier 
can  be  one  bit  and  the  table  contains  only  two  entries:  the 
session  key  of  the  currently  active  XOM  program,  and  the 
null  session. 

At  any  time,  only  one  principal  is  executing  and  thus,  on¬ 
ly  one  XOM  identifier  is  active.  We  refer  to  this  principal 
as  the  “active  principal”.  The  session  key  and  the  corre¬ 
sponding  XOM  identifier  belonging  to  the  active  principal 
are  called  the  “active  key”  and  the  “active  XOM  identifier”. 
When  data  is  produced  by  the  program,  the  abstract  ma¬ 
chine  automatically  tags  it  with  the  active  XOM  identifier. 
When  data  is  read,  the  tag  on  the  data  is  compared  with  the 
active  XOM  identifier.  If  the  comparison  succeeds,  the  read 
is  allowed,  otherwise  the  read  causes  an  exception.  Thus,  no 
principal  can  access  the  data  of  another  principal — the  tags 
create  the  compartments  that  provide  the  isolation  required 
for  security. 

In  addition  to  protecting  the  data,  the  abstract  machine 
provides  two  instructions:  enter.xom  and  exiLxom.  XOM 
code  is  preceded  by  an  enter. xom  instruction,  where  the 


source  register  holds  the  starting  memory  address  of  the  en¬ 
crypted  session  key  for  this  XOM  code.  The  instruction 
indicates  to  the  XOM  machine  that  all  following  code  be¬ 
longs  to  a  principal  associated  with  the  session  key.  The 
machine  checks  to  sec  if  the  session  key  has  already  been 
decoded.  If  the  session  key  is  already  in  the  table,  it  sets 
the  active  identifier  to  that  entry  and  starts  to  fetch  XOM 
code.  If  no  entries  in  the  session  key  table  match,  the  ma¬ 
chine  chooses  a  free  entry,  sets  the  active  identifier  to  this 
entry,  runs  the  asymmetric  decryption  algorithm  on  the  key 
and  then  enters  the  key  pair  in  the  table  before  fetching  the 
first  XOM  instruction. 

While  in  XOM  mode,  all  instructions  are  decrypted  us¬ 
ing  the  session  key  before  they  are  placed  in  the  instruction 
stream  for  execution.  Other  than  decrypting  the  instruc¬ 
tions  and  tagging  the  data,  the  machine  operates  like  a  con¬ 
ventional  machine.  There  are  two  kinds  of  events  that  will 
cause  the  active  identifier  to  change.  The  normal  event  is 
the  execution  of  an  exit.xom  instruction.  This  instruction 
changes  the  active  identifier  back  to  null,  and  the  machine 
stops  decrypting  instructions.  The  “abnormal”  event  occurs 
when  a  trap  or  interrupt  is  taken.  In  this  case,  an  implicit 
exit.xom  instruction  is  executed  before  the  instructions  from 
the  handler  are  executed. 

To  complete  the  abstract  machine  we  need  two  addition¬ 
al  instructions  to  allow  communication  between  principals. 
This  communication  is  provided  by  the  mv.to.null  and 
mv-from.null  instructions.  These  instructions  allow  a  con¬ 
trolled  way  to  change  the  tags  associated  with  a  piece  of 
data.  The  mv-to.null  instruction  takes  data  that  is  tagged 
with  the  active  XOM  identifier  and  changes  the  tag  on  it  to 
the  null  principal.  After  this  instruction  is  executed,  access 
to  the  data  by  the  original  principal  results  in  an  excep¬ 
tion.  Executing  this  instruction  on  data  that  is  not  original¬ 
ly  tagged  with  the  currently  executing  principal  also  results 
in  an  exception.  The  mv.from.null  instruction  changes  data 
tagged  by  the  null  principal  to  the  active  XOM  tag.  Once 
again  data  has  to  be  originally  tagged  with  null  before  this 
instruction  can  be  executed. 

These  instructions  and  the  semantics  of  tagging  guarantee 
that  when  a  principal  reads  data  it  will  only  get  values  that 
were  either  created  by  itself  or  explicitly  brought  into  its 
compartment.  The  simplicity  of  this  compartment-based 
protection  method  appeals  to  us — the  basic  tag  approach 
has  no  special  cases  and  does  not  depend  on  the  specifics  of 
the  processor  execution  model. 

In  summary,  the  abstract  machine  described  thus  far  pro¬ 
vides  the  following  four  mechanisms. 

1.  A  scheme  to  decrypt  symmetric  keys  using  the  private 
half  of  a  public  key  pair. 

2.  Facilities  for  decoding  the  program  code  using  a  de¬ 
crypted  session  key. 

3.  Instructions  for  entering  and  exiting  XOM  mode,  with 
traps  and  interrupts  causing  an  implicit  exit  from 
XOM  mode. 

4.  A  data  tagging  scheme  that  prevents  principals  from 
accessing  data  belonging  to  other  principals. 

These  mechanisms  would  be  sufficient  if  either  external 
memory  was  secure,  or  we  did  not  need  to  use  it.  Since  we 
acknowledge  that  memory  can  be  probed,  and  that  it  will 


be  necessary  to  use  external  memory  in  most  cases,  we  will 
need  to  deal  with  its  insecurity.  We  again  use  cryptography, 
this  time  to  extend  the  compartments  to  external  memory. 

2.1  External  Memory 

To  extend  compartments  to  external  memory,  we  encrypt 
the  tagged  data  using  the  appropriate  session  key  when  it 
leaves  the  confines  of  the  abstract  machine.  Unfortunately, 
this  is  not  sufficient  to  provide  the  same  guarantees  that  the 
internal  tags  provide — we  still  need  a  mechanism  to  prevent 
access  to  data  that  was  tampered  with  while  it  was  in  mem¬ 
ory.  To  allow  the  machine  to  check  the  ownership  as  well  as 
protect  the  data,  a  secure  hash  is  associated  with  all  values 
stored  in  memory.  If  an  external  agent  tampers  with  the 
data,  then  the  hash  will  not  match  and  the  instruction  will 
cause  an  exception. 

To  support  the  needed  functionality,  a  XOM  machine  pro¬ 
vides  two  pairs  of  instructions  to  move  data  between  exter¬ 
nal  memory  and  the  machine:  storesecure/ loadsecure  and 
store-from-null/  load-from-null 

The  store^secure  and  loadsecure  act  like  normal  load  and 
store  instructions  for  the  XOM  process.  They  are  used  by 
the  currently  executing  XOM  program  to  move  data  be¬ 
tween  memory  and  registers  that  are  in  the  active  compart¬ 
ment.  Thus,  if  the  register  named  in  a  store^secure  is  tagged 
with  an  identifier  other  than  the  active  one,  the  instruction 
raises  an  exception.  Similarly,  if  the  load-secure  memory  da¬ 
ta  hash  does  not  match  the  data  the  instruction  will  cause 
an  exception.  The  store-secure  instruction  encrypts  data 
with  the  active  session  key  and  creates  a  hash  to  check  for 
ownership.  The  loadsecure  instruction  takes  a  destination 
register  and  memory  location  as  arguments.  It  decrypts  the 
contents  of  memory  using  the  active  session  key,  verifies  the 
hash,  loads  the  decrypted  value  into  the  register  and  changes 
the  tag  on  the  register  to  the  active  identifier.  If  there  is  a 
mismatch  in  the  hash,  the  instruction  will  cause  an  excep¬ 
tion. 

The  store-from-null  and  load-from-null  instructions  are 
additional  instructions  to  make  it  possible  for  a  XOM  pro¬ 
gram  to  read  and  write  data  in  the  null  compartment.  These 
instructions  neither  encrypt  or  decrypt  their  data,  nor  have 
secure  hashes  associated  with  them.  They  behave  just  as 
load  and  store  instruction  do  on  a  conventional  processor. 
In  a  XOM  machine  these  instructions  do  not  change  the 
ownership  of  the  data,  and  so  the  data  to  be  stored  on  a 
store-from-null  instruction  must  be  tagged  with  the  null 
identifier,  and  the  data  left  by  a  load-from-null  is  tagged 
with  the  null  identifier.  Again,  a  store-from-null  on  a  regis¬ 
ter  that  is  not  tagged  with  null  results  in  an  exception. 

As  we  will  describe  in  the  implementation  section,  to  re¬ 
duce  both  the  time  and  bit  overhead  of  memory  operations, 
the  encryption  and  hash  generation  are  pushed  through  the 
memory  hierarchy,  so  it  is  only  done  when  the  data  leaves  the 
processor  and  enters  insecure  memory.  The  on-chip  caches 
use  tags,  just  like  the  registers,  to  provide  access  control. 

2.2  Supporting  Interrupts 

To  allow  XOM  code  to  be  interrupted  and  restarted  we 
need  to  remove  any  dependency  between  the  operating  sys¬ 
tem’s  resource  management  responsibilities  and  compart¬ 
ment  security.  On  an  interrupt,  an  untrusted  operating  sys¬ 
tem  must  be  allowed  to  save  the  register  state  of  a  XOM 
process  without  being  able  to  interpret  or  leak  the  contents 


of  the  registers.  The  current  machine  does  not  have  this  ca¬ 
pability,  since  a  process  can  access  protected  registers  only 
if  the  contents  of  the  register  is  in  its  compartment.  We 
need  to  add  two  more  instructions — the  saves ecure  and  re- 
store-secure  instructions — to  provide  this  ability. 

The  savesecure  and  restores  ecure  instructions  are  used 
by  the  currently  executing  program  to  move  data  that  it 
does  not  own  and  does  not  belong  to  the  null  principal.  The 
savesecure  instruction  takes  the  contents  of  a  register  and 
creates  an  encapsulated  version  of  this  data  that  another 
principal  can  move  but  cannot  manipulate.  It  first  encrypts 
the  register  contents  with  the  key  of  the  register  owner,  and 
then  calculates  a  hash  that  includes  the  identity  of  the  regis¬ 
ter.  It  places  the  encrypted  data,  hash,  and  the  XOM  iden¬ 
tifier  into  a  set  of  special  registers,  which  are  now  owned  by 
the  currently  active  session.  This  data  can  then  be  stored 
to  memory  if  needed. 

The  restoresecure  instruction  is  the  inverse  of  the  savese¬ 
cure  operation  and  it  is  used  to  restore  the  data  back  to  the 
same  register.  The  instruction  uses  the  special  registers  that 
hold  the  encrypted  data,  hash,  and  destination  key  identifi¬ 
er.  It  uses  the  destination  key  to  decrypt  the  data  and  check 
the  hash.  If  the  hash  matches  both  the  decrypted  data  and 
destination  register,  the  decrypted  data  is  written  into  the 
destination  register  and  the  tag  is  set  to  the  destination  key 
identifier.  In  this  way,  we  ensure  that  the  register  contents 
are  not  altered  and  that  the  values  are  restored  back  to  the 
same  register  from  which  they  were  saved. 

The  savesecure  and  restoresecure  instructions  package 
up  protected  data,  allowing  another  principal  to  move  the 
data  around  without  allowing  them  to  tamper  with  it.  This 
provides  an  operating  system  the  means  to  schedule  XOM 
processes  without  violating  the  security  of  our  compart¬ 
ments.  We  describe  the  implementation  of  these  functions 
in  more  detail  in  Section  4. 

3.  SECURITY  ISSUES 

Any  system  may  be  the  target  of  a  wide  range  of  security 
attacks.  While  the  ultimate  attack  is  one  that  directly  caus¬ 
es  the  secrets  to  be  revealed,  it  is  more  often  the  case  that 
several  weaker  attacks  are  combined  to  achieve  the  same 
goal.  To  this  end,  an  adversary  may  try  to  manipulate  the 
target  in  such  a  way  as  to  leak  information  about  the  hid¬ 
den  secret.  With  this  additional  information,  an  adversary 
can  constrain  her  search  space  and  eventually  mount  an  ex¬ 
haustive  search.  Since  we  expect  our  model  to  work  in  the 
presence  of  untrusted  external  memory,  we  must  assume 
that  an  adversary  will  tamper  with  the  values  stored  in  mem¬ 
ory.  We  first  discuss  three  potential  attacks  that  can  arise 
in  this  context:  spoofing,  splicing,  and  replay,  and  then  look 
at  other  ways  a  XOM  machine  can  leak  information, 

A  spoofing  attack  is  where  an  adversary  generates  data 
and  tries  to  pass  it  off  as  valid  data.  Such  an  attack  against 
XOM  would  involve  replacing  values  in  memory,  including 
instructions  or  data  values,  with  spurious  ciphertext.^  If  the 
XOM  machine  blindly  accepted  these  spurious  values  and 
operated  on  them,  it  may  alter  the  behavior  of  the  XOM 
program  in  such  a  way  that  information  about  the  copy¬ 
protected  code  is  revealed.  The  usual  cryptographic  solution 
to  a  spoofing  attack  is  to  employ  a  Message  Authentication 

^Ciphertext  is  the  term  assigned  to  encrypted  data.  Like¬ 
wise,  plaintext  is  any  data  that  is  unencrypted. 


Code  (MAC)  [24,  13,  15].  A  MAC  is  a  keyed,  one-way  hash 
of  the  message.  The  hash  is  easily  reproduced  to  check  for 
authenticity,  but  it  is  difficult  to  find  another  message  that 
hashes  to  the  same  value.  Thus,  the  store^sexure  instruc- 
tion  generates  a  MAC  of  the  encrypted  data  and  saves  it 
along  with  the  data  in  external  memory  as  mentioned  ear¬ 
lier.  When  the  value  is  read  back  in,  via  the  load^secure  in¬ 
struction,  the  data  is  checked  with  its  accompanying  MAC 
for  authenticity.  Execution  is  halted  if  the  MAC  cannot  be 
verified.  Since  the  adversary  cannot  generate  a  valid  MAC, 
spoofing  values  in  memory  is  impossible. 

Splicing  attacks  involve  taking  valid  fragments  of  valid  ci¬ 
phertext  (in  our  case,  portions  of  XOM  code  or  data)  and 
reordering  or  duplicating  them  at  different  locations.  The 
intended  goal  of  this  type  of  attack  is  also  to  trick  the  ma¬ 
chine  into  executing  the  modified  XOM  program  in  the  hope 
that  it  will  reveal  some  secret  about  the  original  XOM  pro¬ 
gram. 

To  prevent  this  type  of  attack,  the  MAC  used  in  the  ab¬ 
stract  machine  includes  a  position  dependent  attribute  with¬ 
in  it.  In  the  case  of  data  stored  in  memory  and  instructions, 
we  include  the  virtual  address  of  the  memory  location.  Like¬ 
wise,  in  the  case  of  register  data,  we  include  the  register 
number.  During  both  instruction  and  data  fetch  from  ex¬ 
ternal  memory,  the  MAC  of  the  fetched  data  is  checked  to 
ensure  that  the  data  has  not  been  tampered  with.  If  the 
MAC  does  not  match,  the  machine  will  take  an  exception 
and  the  XOM  application  will  halt. 

The  XOM  architecture  also  addresses  replay  attacks  where 
the  adversary  records  previous  valid  ciphertexts  and 
re-inputs  them  to  the  XOM  machine  at  a  later  time.  Since 
the  code  is  static,  this  approach  only  attacks  the  data  val¬ 
ues.  The  ability  to  interrupt  the  XOM  process  gives  an  ad¬ 
versary  access  to  many  valid  register  ciphertexts.  Likewise, 
the  ability  to  watch  the  memory  traffic  allows  the  adversary 
to  replay  memory  values.  To  remedy  this  problem,  we  avS- 
sociate  a  mutating  register  with  each  XOM  identifier  and 
place  this  in  the  session  key  table.  This  register  is  updated 
each  time  the  XOM  process  is  interrupted.  Including  this 
value  in  the  hash  used  for  the  savesecure  and  restoresecure 
commands  prevents  the  register  values  from  being  replayed, 
since  values  from  a  previous  interrupt  will  not  match  the 
current  mutating  register  value.^  Safe  register  values  can 
then  be  used  to  protect  critical  memory  data. 

Aside  from  guarding  against  spoofing,  splicing  and  replay 
attacks,  our  architecture  also  guarantees  that  XOM  code 
intended  for  one  machine  may  not  be  executed  on  another 
machine  with  a  different  key.  Thus,  the  software  is  copy- 
resistant.  We  accomplish  this  without  trusting  any  other 
entity  other  than  the  XOM  machine  itself.  In  particular,  we 
do  not  rely  on  the  security  of  the  operating  system  or  the 
memory. 

However,  there  are  also  some  limitations  of  this  execution 
model.  Since  the  abstract  machine  is  trusted,  a  buggy,  ma¬ 
licious,  or  a  compromised  abstract  machine  can  reveal  the 
secrets  of  a  XOM  program,  allowing  it  to  be  copied  or  mod¬ 
ified.  We  believe  that  the  simplicity  of  the  internal  tagging 
mechanism  will  reduce  the  probability  of  errors,  but  it  is  still 

^A  side  effect  of  this  is  that  the  XOM  machine  cannot  dif¬ 
ferentiate  between  a  fork,  wffiich  is  a  legitimate  case  where 
memory  values  must  be  duplicated,  and  a  real  replay  at¬ 
tack.  Our  current  research  is  looking  at  methods  that  allow 
applications  to  fork,  while  still  preventing  replay  attacks. 


an  issue.  Furthermore,  we  assume  that  the  data  on-chip 
is  secure,  although  there  are  techniques  that  allows  one  to 
probe  a  chip  using  internal  scan  chains,  powder  analysis  [6,  14] 
or  various  other  more  exotic  techniques  [21].  In  a  XOM  pro¬ 
cessor  this  can  be  solved  by  disabling  internal  sccin  chains 
and  other  test  hardware,  or  forcing  them  obey  the  same  tag 
access  rules  as  the  rest  of  the  hardw^are,  and  packaging  chips 
in  a  way  that  makes  probing  impossible. 

If  tw'o  applications  wish  to  share  data,  they  must  negoti¬ 
ate  a  common  shared  key  through  standard  cryptographic 
methods.  Two  XOM  programs  may  not  share  a  session  key. 
Sharing  a  key  would  enable  an  adversary  to  splice  instruc¬ 
tions  from  the  two  programs  in  unauthorized  w^ays. 

Finally,  our  model  also  leaks  information  at  the  external 
memory  interface.  An  adversary  can  w^atch  the  memory 
traffic  and  determine  an  address  trace  of  the  XOM  pro¬ 
gram.  The  coarseness  of  this  address  trace  will  vary  with 
the  amount  of  caching  used  in  an  implementation  of  the  ab¬ 
stract  machine.  Whether  this  information  leakage  can  be 
effectively  exploited  is  currently  an  open  question. 

4.  IMPLEMENTATION  OF  XOM 

There  are  many  hardware  and  software  tradeoffs  in  imple¬ 
menting  a  processor  capable  of  executing  XOM  code.  This 
section  initially  describes  a  modest  amount  of  hardware  in 
conjunction  with  a  virtual  machine  monitor  [12]  that  can 
be  used  to  run  simple  XOM  code.  By  simple  XOM  code, 
we  mean  code  that  cannot  be  interrupted  and  does  not 
store  trusted  data  in  external  memory.  Next,  we  augment 
this  hardware  to  implement  a  machine  that  efficiently  sup¬ 
ports  the  full  XOM  execution  model,  which  handles  external 
memory  and  interrupts. 

4.1  A  Simple  XOM  Machine 

A  simplified  abstract  machine  that  does  not  support  ex¬ 
ternal  memory  or  interrupts  can  be  used  to  create  copy¬ 
protected  code.  Since  the  XOM  code  in  this  model  is  quite 
restricted,  we  propose  a  software  model  where  most  of  the 
application  runs  unencrypted  and  only  certain  sensitive  sec¬ 
tions  of  the  code  are  encrypted  and  run  in  XOM  mode.  In 
this  scenario,  the  XOM  sections  form  opaque  functions  that 
the  programmer  uses  to  secure  the  application.  These  func¬ 
tions  have  access  to  the  session  key,  which  is  used  to  encrypt 
or  decrypt  required  data  and  instructions.  Both  for  security 
and  simplicity,  the  XOM  code  segments  are  not  restartable 
if  interrupted.  If  an  exception  does  occur,  all  the  data  com¬ 
puted  in  the  XOM  code  segment  is  lost,  and  there  is  no 
mechanism  to  undo  any  actions  the  XOM  code  may  have 
performed.  Temporary  data  storage  for  the  XOM  code  could 
provided  by  a  tagged  on-chip  memory. 

There  are  several  possibilities  for  selecting  which  parts 
of  the  application  can  be  made  into  these  opaque  functions. 
The  ideal  candidate  for  XOM  is  a  piece  of  code  that  is  short, 
idempotent,  but  critical  to  the  usefulness  or  operation  of  the 
overall  code.  For  example,  in  the  case  of  a  streaming  media 
decoder,  short,  periodic  sections  of  the  media  stream  could 
be  encrypted,  and  the  XOM  functions  would  decrypt  the 
data  while  maintaining  the  secrecy  of  the  key. 

4.2  Creating  a  Simple  XOM  Machine 

This  simple  XOM  machine  can  be  implemented  almost 
completely  by  running  a  special  XOM  Virtual  Machine  Mon¬ 
itor  (called  XVMM)  on  a  slightly  modified  CPU.  The  main 
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Figure  1:  The  Simple  XOM  Architecture 


hardware  additions  to  the  CPU  include  special  microcode 
that  stores  the  private  key,  private  on-chip  memory, 
the  ability  to  trap  on  instruction  cache  misses  and  a  spe¬ 
cial  privileged  mode  under  which  the  XVMM  runs. 

A  block  diagram  illustrating  the  additional  architectural 
blocks  required  to  implement  the  simple  XVMM  is  shown 
in  Figure  1.  The  modifications  required  on  a  standard  pro¬ 
cessor  axe  outlined  in  bold. 

The  actual  XVMM  could  be  implemented  in  either  soft¬ 
ware  or  in  microcode.  Software  implementations  must  be 
authenticated  by  a  secure  booting  mechanism  such  as  those 
described  in  the  literature  [2,  26,  17].  Either  way,  the 
XVMM  executes  as  a  trusted,  authorized,  and  privileged 
program.  There  are  special  hardware  facilities  that  only  the 
XVMM  can  access,  such  as  the  private  key,  secure  on-chip 
memory  and  the  revectoring  of  certain  interrupts.  This  is 
why  the  XVMM  must  run  at  a  privilege  level  higher  than 
that  of  the  untrusted  operating  system. 

For  the  simple  execution  model,  the  XVMM  must  sup¬ 
port:  decryption  of  the  instruction  stream,  tagged  data 
within  the  machine,  and  the  four  instructions  enier.xom, 
exit-xorrij  mv-to-null  and  mv-from-null 

Decryption  of  Instructions 

Decrypting  the  instruction  stream  is  straightforward  to 
achieve  if  the  CPU  vectors  I-cache  misses  to  the  XVMM, 
which  can  then  decrypt  the  code  and  insert  the  decrypted 
instructions  into  the  cache.  I-cache  miss  handling  in  soft¬ 
ware  is  not  typically  available  on  modern  processors,  but 
does  not  require  much  additional  hardware.  In  addition  to 
the  input  to  the  exception  logic,  all  the  machine  needs  is  a 
mechanism  that  allows  the  processor  to  write  data  into  the 
I-cache.  The  I-miss  trap  handler  would  fetch  the  requested 
cache  line,  decrypt  it  and  check  the  hash,  and  then  write 


this  data  into  the  I-cache.  Depending  on  the  desired  perfor¬ 
mance,  decryption  of  the  instructions  can  be  done  entirely  in 
software  by  the  XVMM  or  with  special-purpose  hardware. 

Tagging  Data 

Data  in  the  machine  (I-cache  and  registers)  can  be  tagged 
with  minimal  changes  to  the  hardware.  The  obvious  solution 
is  to  directly  add  a  hardware  tag  to  each  unit  of  hardware 
storage  that  requires  one.  However,  this  additional  hardware 
is  not  strictly  needed  for  the  simple  XOM  machine.  Instead 
of  explicitly  adding  hardware,  the  XVMM  could  simply  re¬ 
move  the  tagged  data  from  the  machine  on  every  (implicit 
or  explicit)  exit-xom  instruction.  For  the  simple  machine 
both  the  machine  registers  and  the  I-cache  would  be  flushed, 
since  no  XOM  state  is  needed  after  the  XOM  code  finishes. 
In  this  model,  a  shadow  register  file  is  needed  to  hold  pub¬ 
lic  data  required  for  the  special  XOM  instructions  such  as 
mv-to-fiull  and  mv^from-null 

The  limitation  of  this  model  is  that  the  machine  will  not 
cause  an  exception  if  the  data  has  the  wrong  XOM  tag — it 
only  prevents  the  protected  data  from  being  read.  If  XOM 
code  cannot  be  interrupted,  this  level  of  protection  is  suf¬ 
ficient,  because  interrupts  cause  an  implicit  exit-xom.  To 
ensure  that  protection  violations  are  trapped  requires  that 
register  state  be  extended  to  include  a  valid  bit.  If  the  valid 
bit  is  cleared,  the  contents  of  the  register  are  invalid,  and  a 
read  access  to  a  register  causes  a  trap.  Register  reads  pro¬ 
ceed  as  normal  if  the  valid  bit  is  set.  A  write  to  a  register 
always  succeeds  and  sets  the  valid  bit.  The  valid  bit  can 
be  explicitly  tested  as  well  as  cleared  by  the  XVMM,  and 
it  indicates  that  the  requested  data  is  owned  by  the  active 
XOM  process. 

In  addition  to  the  registers,  the  machine  needs  to  contain 
private  memory  for  the  XVMM  to  use.  The  XVMM  memory 


needs  to  be  protected  so  it  can  only  be  accessed  by  the 
monitor,  while  any  storage  for  the  private  data  of  XOM 
applications  can  be  cleared  on  an  exit-xom. 

Private  Key 

The  last  addition  to  the  hardware  is  a  private  key  that  could 
be  contained  in  microcode  of  the  processor.  This  code  im¬ 
plements  the  asymmetric  cipher  and  outputs  the  symmetric 
key  needed  for  XOM  execution.  Because  the  private  key 
exists  as  microcode,  it  may  be  updated  by  a  secure  XOM 
program.  In  this  way,  an  authorized  party  could  create  sev¬ 
eral  chips  with  the  same  private  key.  This  would  be  required 
for  processor  upgrades  or  in  the  implementation  of  a  shared 
memory  multiprocessor. 

The  XOM  Virtual  Machine  Monitor  (XVMM) 

The  XVMM  implements  the  special  XOM  instructions  and 
provides  data  tagging  using  the  facilities  of  the  hardware 
described  above. 

If  the  registers  of  the  machine  are  not  explicitly  tagged, 
it  organizes  a  portion  of  the  private  memory  as  a  set  of 
tagged  registers.  For  each  general  purpose  register  in  the 
CPU,  private  memory  has  a  corresponding  shadow  register 
of  the  same  size  and  an  associated  XOM  identifier  tag  of  a 
suitable  length.  The  basic  idea  is  that  the  combination  of 
a  CPU  register  with  its  single  valid  bit,  the  corresponding 
shadow  register  and  tag  implemented  in  software  is  func¬ 
tionally  equivalent  to  having  CPU  registers  with  long  tags. 

A  separate  region  of  the  private  memory  holds  a  session 
key  table  containing  decrypted  session  keys  for  the  various 
XOM  identifiers.  The  XOM  tags  used  in  the  shadow  regis¬ 
ters  are  indices  into  this  key  table.  The  XVMM  keeps  track 
of  the  index  of  the  currently  executing  XOM  session.  Index 
zero  refers  to  the  null  tag,  which  refers  to  the  untrusted  null 
principal. 

The  XVMM  implements  the  four  special  XOM  mode  in¬ 
structions  as  follows. 

enter.xom:  The  XVMM  loads  the  session  key  of  the 
XOM  code  into  the  session  key  table  if  not  already  present. 
The  XVMM  maintains  a  128  bit  cryptographic  hash  of  the 
encrypted  session  key  along  with  its  decrypted  form.  The 
presence  check  is  performed  using  this  hash  value.  A  failed 
check  entails  an  asymmetric  key  decryption  operation  to 
generate  the  session  key.  Shadow  registers  whose  tags  match 
that  of  the  XOM  session  are  copied  into  their  corresponding 
CPU  registers,  which  are  then  marked  valid.  All  other  CPU 
registers  are  mcirked  invalid. 

The  XVMM  registers  a  handler  for  cache  miss  events  so 
that  I-cache  misses  incurred  during  the  execution  of  XOM 
code  will  be  correctly  vectored  to  it.  Similarly,  it  also  revec¬ 
tors  all  CPU  exceptions  and  interrupts  to  itself  so  that  it 
can  do  an  implicit  exit^xom  instruction  whenever  there  is  an 
interrupt  or  exception. 

exit-xom:  The  XVMM  unregisters  the  handler  for  cache 
miss  faults  and  restores  handlers  for  all  CPU  interrupts  and 
exceptions.  It  copies  all  shadow  registers  whose  tags  are 
null  into  the  corresponding  CPU  register.  All  other  CPU 
registers  are  marked  invalid.  If  this  exitjxom  is  a  result  of 
an  interrupt,  the  mutating  register  associated  with  the  in¬ 
terrupted  principal  is  updated. 

mv-to-nuU:  The  XVMM  checks  that  the  CPU  register 
has  the  valid  bit  set.  If  not,  it  raises  an  exception.  Oth¬ 
erwise,  it  moves  the  contents  of  the  CPU  register  into  the 


corresponding  shadow  register,  tags  the  shadow  register  as 
null,  and  marks  the  CPU  register  as  invalid. 

mv-frorri-null:  The  XVMM  checks  to  see  if  the  CPU  reg¬ 
ister  is  valid.  If  it  is,  then  it  raises  an  exception.  Otherwdse 
it  moves  the  contents  of  the  shadow^  register  into  the  CPU 
register  and  sets  the  valid  bit. 

4.3  Full  XOM  Machine 

To  extend  this  implementation  to  support  the  full  XOM 
machine  requires  support  for  secure  loads  and  stores  to  the 
external  memory,  as  well  as  saves  and  restores  of  protected 
registers.  The  hardware  consists  of  tag  bits  in  the  on-chip 
cache  to  store  the  XOM  identifier  for  XOM  data.  We  first 
describe  what  is  needed  to  support  context  switches,  and 
then  look  at  the  more  complex  problem  of  support  secure 
external  memory. 

If  XOM  code  is  interruptible,  an  adversary  may  mount 
a  spoofing  attack,  by  changing  values  while  the  code  is  in¬ 
terrupted.  To  guard  against  this,  the  XOM  machine  must 
raise  an  exception  if  a  principal  reads  data  from  a  different 
XOM  session.  This  would  detect  the  case  where  an  adver¬ 
sary  has  overwTitten  XOM  data  with  spoofed  data.  Thus, 
if  a  full  XOM  tag  is  not  used,  registers  need  to  be  protect¬ 
ed  minimally  by  a  valid  bit.  In  addition  a  mutating  key  is 
associated  with  each  session  key  to  protect  against  replay 
attacks.  The  XVMM  must  change  this  key  each  time  the 
XOM  process  is  interrupted.  The  only  other  support  need¬ 
ed  is  for  XVMM  code  to  implement  the  save.secure  and 
restorers ecure  instructions. 

save^secure:  This  instruction  takes  one  source  register 
and  uses  a  set  of  special  registers  as  destinations.  First,  the 
register  value  is  encrypted  using  the  session  key  correspond¬ 
ing  to  the  register  tag  and  is  placed  in  the  first  destination 
register.  Next,  a  128  bit  hash  is  calculated  based  on  the 
register  contents,  the  mutating  key,  and  the  register  num¬ 
ber.  We  assume  that  our  machine  has  64  bit  registers  and 
so  the  hash  is  placed  in  the  next  two  registers.  Finally,  the 
XOM  register  tag  is  stored  in  the  fourth  register.  Thus,  w^e 
need  four  special  registers  to  support  this  operation.  The 
destination  registers  are  all  tagged  with  the  identifier  of  the 
principal  that  called  save^secure.  The  calling  principal  is 
now  free  to  save  these  values  as  it  would  any  other  data  it 
owns. 

restores ecure:  This  instruction  is  simply  the  reverse  of 
savesecure.  It  takes  a  destination  register  as  an  argument, 
and  reads  the  contents  of  the  four  special  registers  used  in 
savesecure  to  determine  the  value  to  place  in  the  destination 
register.  The  XOM  identifier  in  the  fourth  register  is  used  to 
decrypt  the  data  in  the  first  register.  This  data,  along  with 
the  register  number  of  the  destination  register,  is  then  used 
to  regenerate  the  hash,  which  is  compared  with  the  contents 
of  the  second  and  third  registers.  If  the  hash  matches,  the 
decoded  data  is  placed  in  the  destination  register,  and  its 
XOM  tag  is  set  to  the  value  in  the  fourth  source  register. 

Supporting  secure  memory  wdll  require  the  encryption  and 
decryption  of  data  that  is  loaded  or  stored.  As  Section  6  will 
show,  even  with  additional  hardware  this  operation  is  not 
cheap.  For  performance  reasons  we  want  to  cache  the  results 
of  these  operations  as  much  as  possible,  which  drives  us  to 
move  the  on-chip  caches  into  the  protected  XOM  machine. 
With  this  arrangement  the  machine  only  enciphers  data  that 
leaves  the  chip.  Thus,  the  protection  boundary  is  expanded 
to  include  the  caches  as  shown  in  Figure  2. 
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Figure  2:  The  Full  XOM  Architecture 


Adding  the  caches  to  the  protected  XOM  machine  does 
cause  some  complications  since  we  do  not  want  the  over¬ 
heads  of  adding  XOM  tags  per  word,  or  generating  a  hash 
for  each  memory  word.  Instead,  we  want  to  tag  and  hash 
larger  blocks  of  data. 

We  protect  data  in  the  caches  by  adding  a  tag  for  a  XOM 
identifier  to  each  cache  line,  and  adding  a  valid  bit  per  word. 
When  a  line  is  fetched  as  a  result  of  a  secure  operation,  the 
contents  of  that  line  and  its  hash  are  decrypted  and  com¬ 
pared.  If  the  hash  check  succeeds,  the  tag  for  that  line  is 
updated  to  the  active  XOM  identifier,  the  data  is  written 
into  the  line,  and  all  the  valid  bits  are  set.  The  more  in¬ 
teresting  case  is  what  happens  when  the  hash  check  fails. 
In  this  case  we  still  set  the  line’s  tag  to  the  active  XOM 
identifier,  but  we  clear  all  the  valid  bits.  An  exception  oc¬ 
curs  either  if  a  loadsecure  reads  an  invalid  word,  or  if  it 
reads  a  line  whose  tag  does  not  match  the  active  identifier. 
Note  that  even  though  there  is  a  valid  bit  per  word,  data  is 
encrypted  or  decrypted  on  a  per  cache  line  basis. 

The  use  of  valid  bits  allows  a  XOM  process  to  perform  a 
store^secure  into  a  line  that  currently  holds  old  data  from 
another  compartment.  The  store  will  first  cause  the  line  to 
be  fetched  into  the  cache.  The  hash  will  not  match,  since 
it  is  for  a  different  XOM  session,  and  all  the  data  will  be 
marked  invalid.  The  store  will  set  the  valid  bit  for  the  word 
written — so  as  long  as  the  processor  only  reads  data  it  has 
already  written,  no  exception  will  occur.  If  the  requested 
line  for  the  store  is  already  in  the  cache,  but  the  tags  do  not 
match,  a  similar  operation  occurs.  Writing  into  a  cache  line 
always  updates  the  tag  to  the  active  XOM  identifier.  If  this 
changes  the  tag  on  the  line,  all  the  valid  bits  are  cleared, 
except  for  the  data  word  that  was  updated.  Note  that  the 
clearing  of  valid  bits  is  conservative — while  it  might  lose 
valid  data  if  two  XOM  principals  share  a  cache  line,  it  will 


never  allow  valid  XOM  data  to  leak  out  of  its  compartment. 
To  prevent  data  loss,  the  operating  system  must  prohibit  two 
XOM  applications  from  writing  data  into  the  same  physical 
page. 

The  XOM  tag  and  valid  bits  allow  us  to  push  the  encryp¬ 
tion/decryption  operations  to  operate  only  on  second-level 
cache  misses,  and  guarantees  that  each  cache  line  has  only 
one  active  XOM  identifier.  Thus,  the  MAC  can  be  done  on 
the  whole  cache  line  rather  than  each  word,  which  greatly 
reduces  the  memory  overhead  needed  for  storing  the  MAC. 
For  example,  a  128  byte  cache  line  could  easily  accommo¬ 
date  a  128  bit  hash  using  the  extra  bits  normally  allocated 
for  ECC.  The  hash  will  still  be  able  to  catch  almost  all  mem¬ 
ory  errors,  but  will  not  be  able  correct  for  single  bit  errors. 
The  added  complexity  is  that  the  valid  bits  must  be  stored 
along  with  the  data  to  allow  the  cache  to  flush  a  partially 
valid  line.  When  a  cache  line  is  loaded,  the  hash  is  checked 
and  if  the  hash  is  correct,  the  valid  bits  are  restored. 

The  full  XOM  model  requires  slightly  more  extra  hard¬ 
ware  than  the  simple  model,  but  it  is  still  relatively  modest 
(except  for  the  needed  symmetric  cipher  acceleration  that  is 
described  in  the  next  section).  It  consists  of  adding  tag  and 
valid  bits  to  the  on-chip  caches,  the  extra  control  logic  need¬ 
ed  to  deal  with  these  bits,  and  any  cryptographic  hardware. 
The  need  and  cost  of  this  hardware  is  described  next. 

5.  SPECIALIZED  HARDWARE 

We  note  that  most  XOM  operations  are  not  performance 
critical.  Asymmetric  operations  only  occur  when  a  new 
XOM  program  is  started.  Similarly,  register  saves  and  re¬ 
stores  are  infrequent,  and  only  occur  on  interrupts  and  con¬ 
text  switches.  For  instance,  a  study  [18]  shows  that  com¬ 
mercial  applications  have  a  mean  execution  length  between 


10.000  and  49.000  instructions  between  each  context  switch. 
With  hardware  support  for  the  XOM  identifier,  the  main 
performance  issue  is  the  implementation  of  secure  external 
memory.  The  encryption,  decryption,  and  MAC  compu¬ 
tation  of  cache  lines  appears  on  the  critical  path  for  each 
second-level  cache  miss.  Supporting  the  full  XOM  model  is 
very  expensive  without  hardware  acceleration  of  these  oper¬ 
ations. 

The  most  commonly  used  symmetric  cipher  today  is 
DES  [20].  DES  wwks  by  iterating  the  plaintext  through  16 
rounds  of  computation  consisting  of  two  XOR  operations, 
a  table  lookup  and  bit  permutations.  Because  of  its  short 
key  length,  it  is  usually  run  three  times  on  a  data  block  to 
create  the  Triple  DES  cipher  [22,  25].  However,  there  arc 
efforts  underww  to  replace  DES  with  new^er  algorithms  that 
are  both  more  efficient  and  more  secure.  Among  the  most 
promising  are  the  AES  [27]  candidates  Rijndael  [7]  and  Ser¬ 
pent  [3].  Rijndael  is  a  block  cipher  that  only  has  10  rounds 
of  calculation  while  Serpent  has  32  rounds.  A  recent  study 
found  that  when  run  on  a  slightly  optimized  1  GHz  proces¬ 
sor,  Rijndael  offers  the  best  performance,  encrypting  92.6 
MB/s  [5].  This  translates  into  approximately  1  byte  every 
11  cycles.  To  decrypt  a  128  byte  second-level  cache  line 
would  require  1408  cycles,  which  is  too  high  an  overhead  to 
pay  for  every  second-level  cache  miss  or  write- back. 

We  can  mitigate  this  cost  by  adding  special  hardware  to 
perform  the  symmetric  cryptography.  The  maximum  rate 
at  which  this  hardware  must  be  able  to  decrypt  and  encrypt 
data  is  dictated  by  the  peak  bandwidth  of  the  second-level 
cache  to  memory  interface.  As  an  example,  the  next  genera¬ 
tion  x86  CPU,  the  Intel  “Willamette”  [16],  will  have  a  peak 
memory  bandwidth  of  3.2  GB/s  and  a  clock  speed  ranging 
from  1.2  GHz  to  over  1.6  GHz.  With  a  64  bit  memory  bus 
this  corresponds  to  data  being  placed  on  the  bus  every  3- 
4  processor  cycles,  and  w'ould  require  a  cryptographic  unit 
capable  of  keeping  up  with  this  rate. 

We  will  use  Triple  DES  as  an  example  of  a  symmetric 
block  cipher  that  can  be  implemented  as  specialized  hard¬ 
ware,  even  though  faster  ciphers  are  available.  Triple  DES 
takes  a  64  bit  block  and  performs  48  rounds  of  transfor¬ 
mations  on  it.  We  believe  that  it  is  possible  to  build  a 
DES  implementation  that  can  compute  tw'o  rounds  per  cy¬ 
cle,  but  we  conservatively  assume  that  only  one  round  can 
be  computed  [8].  Thus,  it  takes  48  cycles  to  decrypt  a  64  bit 
block.  Because  each  round  is  essentially  identical  to  every 
other  round,  DES  can  easily  be  pipelined.  A  fully  pipelined 
unit  would  require  48  DES  stages  and  could  produce  a  64 
bit  output  each  cycle.  Since  this  is  3-4  times  the  required 
rate,  one  only  needs  a  DES  unit  with  16  pipeline  stages  with 
each  stage  performing  three  rounds  iteratively.  Becau.se  each 
stage  consists  only  of  XOR  gates  and  bit  swizzles,  the  area 
of  such  a  unit  is  dominated  by  the  size  of  the  tables.  The 
tables  are  6  bit  lookups  producing  a  fixed  4  bit  output,  and 
each  pipeline  stage  requires  16  such  tables.  Thus,  in  total 
256  such  tables  arc  required.  Since  these  tables  are  simply 
ROM’s  they  can  be  implemented  in  a  very  dense  fashion. 

We  also  need  to  create  a  MAC  for  each  cache  line.  Unfor¬ 
tunately  these  secure  hash  functions  are  also  expensive  to 
compute.  As  an  example,  we  will  consider  an  HMAC  [15] 
implementation  using  MD5  [13].  MD5  is  a  one  way  hash 
function  that  takes  64  rounds,  each  performing  a  non-linear 
operations  followed  by  four  bitwise  additions  and  a  barrel 
shift  and  produces  a  128  bit  hash.  This  computation  is  com¬ 


parable  to  encrypting  the  data.  While  w^e  could  add  more 
custom  hardware  for  the  hash  (w'hich  w^ould  be  smaller  than 
the  symmetric  cryptography  hardware  since  it  does  not  need 
to  be  pipelined)  there  is  a  simpler  solution. 

We  can  exploit  the  fact  that  a  MAC  provides  much  more 
functionality  then  wc  require.  A  MAC  is  able  to  provide  au¬ 
thentication  for  messages  that  are  not  encrypted,  by  using 
a  hash  that  is  difficult  to  reverse.  Since  the  cache  lines  are 
encrypted,  we  are  free  to  use  a  reversible  hash  for  redundan¬ 
cy.  Because  the  adversary  does  not  know'  the  session  key, 
she  cannot  generate  a  valid  hash  of  any  message  she  creates. 
Thus,  we  may  pick  a  much  faster  hash  (such  as  CRC)  and 
append  that  to  the  cache  line  before  encrypting  with  the 
session  key.  A  CRC  hash  may  be  generated  in  parallel  with 
the  decryption,  and  thus  has  no  additional  cost. 

The  final  issue  deals  with  the  latency  for  verifying  the  hash 
value  is  correct.  Since  the  hash  depends  on  the  entire  line, 
a  simple  implementation  would  delay  returning  any  data  to 
the  processor  until  the  entire  line  w'as  fetch  onto  the  pro¬ 
cessor.  For  the  Triple  DES  example  this  would  double  the 
overhead  from  48  cycles  for  the  first  w'ord,  to  over  100  cycles, 
since  it  takes  3*18  cycles  to  fetch  all  the  data  in  a  128  byte  -I- 
128  bit  cache  line.  To  eliminate  this  additional  overhead  w'e 
can  return  the  requested  w'ord  first,  and  speculatively  start 
the  processor  w'hen  the  requested  word  is  decoded.  If  the 
hash  docs  not  verify  the  XOM  thread  will  abort,  so  we  do 
not  need  to  worry  about  being  able  to  redo  the  load.  All  w'e 
need  to  ensure  is  that  any  operations  that  allow'  information 
to  leak  out  of  the  machine  such  as  stores  cause  the  machine 
to  stall  until  the  check  is  complete. 

6.  PERFORMANCE  IMPLICATIONS 

We  sec  from  the  previous  section  that  most  of  the  perfor¬ 
mance  cost  of  XOM  w'ill  show'  up  as  an  additional  stage  in 
the  memory  pipeline.  We  can  model  this  cost  as  increased 
memory  access  time.  There  is  increased  latency  because  the 
XOM  stage  is  serialized  wdth  the  bus,  and  the  additional  en¬ 
cryption  or  decryption  takes  extra  cycles  to  complete.  It  is 
unlikely  that  an  out-of-order  processor  can  hide  this  latency 
since  the  memory  access  time  is  already  high.  To  the  first 
order,  we  can  compute  the  slow  down  factor  as  follow's: 

slow  down  =  +  L2^miss  x  (XOM  Jat  -^  memJat) 

comp  +  L2jmiss  x  mem  Jat 
_  ,  ,  XOM  Jat 

—  1  I  ;  “  y.  %  of  time  stalled  on  memory 

memJat 

Where  comp  is  the  amount  of  time  the  CPU  is  not  stalled 
on  a  memory  access,  L2-miss  is  the  number  of  second-level 
misses,  XOM  Jat  is  the  latency  of  a  XOM  symmetric  unit 
and  memJat  is  the  latency  of  a  memory  access.  As  w'e  can 
see,  the  tw'o  factors  that  affect  the  performance  is  the  aver¬ 
age  proportion  of  time  the  CPU  is  stalled  on  memory  per 
instruction  and  the  ratio  of  the  latency  of  the  XOM  sym¬ 
metric  units  to  the  memory  access  time.  The  memory  stall 
time  is  dependent  on  the  number  of  second-level  misses, 
which  depends  on  cache  size,  associativity,  and  application. 
The  XOM  overhead  is  clearly  most  important  to  applica¬ 
tions  that  are  already  dominated  by  memory  latency  (high 
second-level  miss  rates).  For  these  applications  the  slow' 
down  for  running  in  a  XOM  application  will  simply  be  one 
plus  the  ratio  of  the  XOM  latency  to  the  memory  latency. 
This  is  encouraging,  since  the  XOM  delay  should  scale  with 


processor  performance,  while  the  memory  access  time  is  scal¬ 
ing  more  slowly.  For  the  “Willamette”  processor  example, 
the  memory  delay  is  over  100  cycles,  so  for  a  completely 
memory  bound  application  the  slow  down  will  be  less  than 
50%  for  a  48  cycle  Triple  DES  implementation. 

7.  SUMMARY 

Supporting  code  that  can  be  executed,  but  not  copied, 
read,  or  changed  is  a  challenging  problem  when  one  con¬ 
siders  that  an  adversary  may  modify  the  operating  system 
running  the  code,  the  data  stored  in  memory,  or  the  machine 
the  code  runs  on.  This  paper  examines  the  implementation 
of  XOM,  a  system  that  protects  an  application’s  code  and 
data  using  cryptographic  techniques,  and  restricts  the  ma¬ 
chine  on  which  this  code  is  allowed  to  run.  Prom  a  security 
standpoint,  the  critical  issue  is  to  prevent  information  from 
leading  out  of  a  XOM  application.  We  accomplish  this  by 
placing  each  application  in  its  own  compartment  implement¬ 
ing  a  policy  that  causes  the  process  to  halt  if  it  tries  to  read 
data  that  is  not  in  its  compartment. 

Implementing  the  compartments  for  on-chip  data  is  rela¬ 
tively  straight-forward.  We  tag  the  data  storage  locations 
with  the  compartment  that  created  it,  and  cause  an  excep¬ 
tion  if  the  data  being  read  does  not  match  the  active  com¬ 
partment.  Protecting  off-chip  data  is  more  complex.  En¬ 
cryption  alone  only  prevents  the  data  from  being  read,  it 
does  not  prevent  it  from  being  changed.  Providing  a  hash 
with  each  piece  of  data  allows  us  to  guard  against  spoof¬ 
ing,  splicing,  and  even  replay  attacks.  By  associating  the 
hash  with  large  cache  lines,  we  can  accomplish  this  protec¬ 
tion  with  small  memory  overhead — using  the  bits  currently 
allocated  for  ECC. 

The  performance  and  hardware  cost  for  this  protection 
is  surprisingly  modest.  The  essential  pieces  of  hardware  to 
be  added  are  the  tags  on  the  on-chip  storage,  the  logic  as¬ 
sociated  with  the  tags  to  check  for  access  violations,  and 
the  hardware  support  for  the  symmetric  encryption  of  the 
memory  traffic.  The  rest  of  the  XOM  functionality  is  not 
performance  critical  and  can  easily  be  implemented  in  soft¬ 
ware,  either  in  a  virtual  machine  monitor,  or  in  microcode. 
If  a  program  could  be  secured  by  using  only  small  XOM 
functions,  then  very  little  new  hardware  is  needed  at  all. 

XOM  allows  one  to  create  a  machine  that  prevents  users 
from  copying  or  modifying  code,  but  raises  a  number  of  se¬ 
curity  and  privacy  issues.  These  issues  continue  to  be  inter¬ 
esting  areas  of  research. 
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